h2m-cli 0.3.2

HTML to Markdown converter.
h2m-cli-0.3.2 is not a library.

H2M

Crates.io Docs.rs CI License Rust

Fast, extensible HTML-to-Markdown converter for Rust — CommonMark + GFM, plugin architecture, zero unsafe.

H2M converts HTML into clean Markdown with full CommonMark compliance and GitHub Flavored Markdown extensions. It uses a plugin-based rule system, supports reference-style links, relative URL resolution, and ships with an async CLI powered by tokio for high-concurrency batch fetching.

Quick Start

Install the CLI

Shell (macOS / Linux):

curl -fsSL https://sh.qntx.fun/labs/h2m | sh

PowerShell (Windows):

irm https://sh.qntx.fun/labs/h2m/ps | iex

Or via Cargo:

cargo install h2m-cli

CLI Usage

# Convert a URL directly
h2m https://example.com

# Extract only the article content
h2m --selector article https://blog.example.com/post

# Smart readable extraction (strips nav, footer, aside, etc.)
h2m --readable https://blog.example.com/post
# Short form
h2m -r https://blog.example.com/post

# Local file with GFM + referenced links, save to file
h2m --gfm --link-style referenced page.html -o output.md

# Pipe from stdin
curl -s https://example.com | h2m --selector main

# JSON output for programmatic / agent consumption
h2m --json https://example.com

# Batch convert multiple URLs (NDJSON streaming output)
h2m --json url1 url2 url3

# Batch from file with concurrency control
h2m --json --urls urls.txt -j 8 --delay 100

# Custom User-Agent
h2m --user-agent "MyBot/1.0" https://example.com

# All formatting options
h2m --gfm --heading-style setext --strong underscores --fence tilde page.html

JSON Output

Single URL produces a pretty-printed JSON object:

{
  "url": "https://example.com",
  "domain": "example.com",
  "status_code": 200,
  "content_type": "text/html; charset=UTF-8",
  "title": "Example Domain",
  "language": "en",
  "description": "This domain is for use in illustrative examples.",
  "markdown": "# Example Domain\n\n...",
  "elapsed_ms": 234,
  "content_length": 1256
}

Multiple URLs produce NDJSON (one JSON object per line), ideal for streaming pipelines.

Library Usage

// One-liner with CommonMark defaults
let md = h2m::convert("<h1>Hello</h1><p>World</p>");
assert_eq!(md, "# Hello\n\nWorld");
// Full control with builder
use h2m::{Converter, Options};
use h2m::plugins::Gfm;
use h2m::rules::CommonMark;

let converter = Converter::builder()
    .options(Options::default())
    .use_plugin(CommonMark)
    .use_plugin(Gfm)
    .domain("example.com")
    .build();

let md = converter.convert(r#"<a href="/about">About</a>"#);
assert_eq!(md, "[About](https://example.com/about)");

Async Fetching

Enable the fetch feature for async HTTP fetching with built-in concurrency control, rate limiting, and streaming output:

use h2m::fetch::Fetcher;

let fetcher = Fetcher::builder()
    .concurrency(8)
    .gfm(true)
    .extract_links(true)
    .build()?;

// Single fetch
let result = fetcher.fetch("https://example.com").await?;
println!("{}", result.markdown);

// Batch with streaming callback
let urls = vec!["https://a.com".into(), "https://b.com".into()];
fetcher.fetch_many_streaming(&urls, |result| {
    match result {
        Ok(r) => println!("{}", r.markdown),
        Err(e) => eprintln!("error: {e}"),
    }
}).await;

Design

  • CommonMark + GFM — full spec compliance with tables, strikethrough, task lists, reference-style links
  • Plugin architecture — extend with custom rules via the Rule trait
  • Async batch pipelinetokio + reqwest, semaphore concurrency, streaming NDJSON (feature-gated)
  • JSON output — structured result with rich metadata (status, language, description, og:image) for agent/programmatic consumption
  • Smart readable extraction — two-phase content detection: semantic selectors → noise stripping (nav, footer, aside, header, ARIA roles)
  • Smart fetching — configurable User-Agent, HTML meta-refresh redirect following
  • Zero-copy fast pathsCow<str> escaping, zero unsafe, Send + Sync

Conversion Examples

Input HTML:

<h1>Title</h1>
<p>A <strong>bold</strong> and <em>italic</em> paragraph with <a href="https://example.com">a link</a>.</p>
<ul>
  <li>First item</li>
  <li>Second item</li>
</ul>
<pre><code class="language-rust">fn main() {}</code></pre>

Output Markdown:

# Title

A **bold** and *italic* paragraph with [a link](https://example.com).

- First item
- Second item

​```rust
fn main() {}
​```

Supported HTML Elements

CommonMark (built-in)

Element Markdown Output
<h1>-<h6> # Heading (ATX) or underline (Setext)
<p>, <div>, <section>, <article> Block paragraph
<strong>, <b> **bold**
<em>, <i> *italic*
<code>, <kbd>, <samp>, <tt> `inline code`
<pre><code> Fenced code block with language detection
<a href="..."> [text](url) or reference-style
<img src="..." alt="..."> ![alt](src "title")
<ul>, <ol>, <li> Bullet/numbered lists with nesting
<blockquote> > quoted text
<hr> ---
<br> Hard line break
<iframe> [iframe](url)

GFM Extensions (with --gfm)

Element Markdown Output
<table> GFM pipe table with alignment
<del>, <s>, <strike> ~~strikethrough~~
<input type="checkbox"> [x] or [ ] (task list)

Auto-removed

Element Behavior
<script> Removed (content stripped)
<style> Removed (content stripped)
<noscript> Removed (content stripped)

Custom Rules

Extend the converter with your own rules by implementing the Rule trait:

use h2m::{Converter, Rule, Action, Context};
use h2m::rules::CommonMark;
use scraper::ElementRef;

struct HighlightRule;

impl Rule for HighlightRule {
    fn tags(&self) -> &'static [&'static str] { &["mark"] }

    fn apply(&self, content: &str, _el: &ElementRef<'_>, _ctx: &mut Context) -> Action {
        Action::Replace(format!("=={content}=="))
    }
}

let mut builder = Converter::builder()
    .use_plugin(CommonMark);
builder.add_rule(HighlightRule);
let converter = builder.build();

let md = converter.convert("<p>This is <mark>important</mark></p>");
assert!(md.contains("==important=="));

License

Licensed under either of:

at your option.

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in this project shall be dual-licensed as above, without any additional terms or conditions.


A QNTX open-source project.

Code is law. We write both.